Data Brew: Brewing Success with Starbucks Customer Data

TEAM 6 - Chekitha Swayampu, Hrushikesh Sai Seshagiri Chowdary Uppalapati, Swathi Murali Srinivasan, Vaishnavi Tamilvanan

ABSTRACT

With the main goals of this extensive project being to maximize the effectiveness of promotional activities, improve overall satisfaction, and increase customer engagement, we conducted a thorough analysis of Starbucks’ marketing strategies. Our analysis involved a deep dive into Starbucks’ customer data, using a variety of visualizations and statistical methods to uncover insights.

Our initiative’s primary goal was to increase customer satisfaction and engagement. Our goal was to determine the best channels for promotions and tailor our products to each individual customer’s needs. We used a variety of sophisticated modeling methods, such as k-means clustering, logistic regression, decision trees, and support vector machines (SVM), to accomplish this. These models were crucial in helping us understand consumer preferences and behavior. Our strategy is expected to improve the efficacy and efficiency of Starbucks’ to improve the efficiency of guaranteeing the company’s sustained customer satisfaction and market dominance. It is fully documented in a R Markdown file in our GitHub repository.

INTRODUCTION

Our project examines the interactions and transactions of 17,000 customers using the Starbucks Customer Dataset. We use K-Means clustering for customer segmentation using data from responses (transcript.csv), customer demographics (profile.csv), and offers (portfolio.csv). With the use of this method, strategies can be more precisely targeted by identifying unique behavioral and demographic patterns. Through the analysis of purchase behaviors and offer responsiveness, our goal is to improve customer engagement and promotional efficacy at Starbucks.

SUMMARY OF DATASET

The data is contained in three files: portfolio.csv - data about offers sent to customers (10 offers x 6 columns) profile.csv - demographic data of customers (17,000 customers x 5 columns) transcript.csv - customer response to offers and transactions made (306,648 events x 4 columns)

DATASET OVERVIEW

# Load data
portfolio <- read.csv("data/portfolio.csv", row.names = 1)
profile <- read.csv("data/profile.csv", row.names = 1)
transcript <- read.csv("data/transcript.csv", row.names = 1)

BEFORE CLEANING

PORTFOLIO

head(portfolio)
reward channels difficulty duration offer_type id
0 10 [‘email’, ‘mobile’, ‘social’] 10 7 bogo ae264e3637204a6fb9bb56bc8210ddfd
1 10 [‘web’, ‘email’, ‘mobile’, ‘social’] 10 5 bogo 4d5c57ea9a6940dd891ad53e9dbe8da0
2 0 [‘web’, ‘email’, ‘mobile’] 0 4 informational 3f207df678b143eea3cee63160fa8bed
3 5 [‘web’, ‘email’, ‘mobile’] 5 7 bogo 9b98b8c7a33c4b65b9aebfe6a799e6d9
4 5 [‘web’, ‘email’] 20 10 discount 0b1e1539f2cc45b7b9fa7c272da2e1d7
5 3 [‘web’, ‘email’, ‘mobile’, ‘social’] 7 7 discount 2298d6c36e964ae4a3e7e9706d1fb8c2

PROFILE

head(profile)
gender age id became_member_on income
0 118 68be06ca386d4c31939f3a4f0e3dd783 20170212 NA
1 F 55 0610b486422d4921ae7d2bf64640c50b 20170715 112000
2 118 38fe809add3b4fcf9315a9694bb96ff5 20180712 NA
3 F 75 78afa995795e4d85b5d9ceeca43f5fef 20170509 100000
4 118 a03223e636434f42ac4c3df47e8bac43 20170804 NA
5 M 68 e2127556f4f64592b11af22de27a7932 20180426 70000

TRANSCRIPT

head(transcript)
person event value time
0 78afa995795e4d85b5d9ceeca43f5fef offer received {‘offer id’: ‘9b98b8c7a33c4b65b9aebfe6a799e6d9’} 0
1 a03223e636434f42ac4c3df47e8bac43 offer received {‘offer id’: ‘0b1e1539f2cc45b7b9fa7c272da2e1d7’} 0
2 e2127556f4f64592b11af22de27a7932 offer received {‘offer id’: ‘2906b810c7d4411798c6938adc9daaa5’} 0
3 8ec6ce2a7e7949b1bf142def7d0e0586 offer received {‘offer id’: ‘fafdcd668e3743c1bb461111dcafc2a4’} 0
4 68617ca6246f4fbc85e91a2a49552598 offer received {‘offer id’: ‘4d5c57ea9a6940dd891ad53e9dbe8da0’} 0
5 389bc3fa690240e798340f5a15918d5c offer received {‘offer id’: ‘f19421c1d4aa40978ebb69ca19b0e20d’} 0

DATA CLEANING

# Expand "channels" into binary columns of all different channels in the dataset (email, web, mobile, social)

library(dplyr)
library(stringr)

# Create binary columns for each channel
channels_list <- c('email', 'web', 'mobile', 'social')

portfolio_channels <- portfolio %>%
  mutate(email = as.numeric(str_detect(channels, 'email')),
         web = as.numeric(str_detect(channels, 'web')),
         mobile = as.numeric(str_detect(channels, 'mobile')),
         social = as.numeric(str_detect(channels, 'social')))


# Create binary columns for each offer type
portfolio_offertype <- portfolio %>%
  mutate(bogo = as.numeric(offer_type == 'bogo'),
         informational = as.numeric(offer_type == 'informational'),
         discount = as.numeric(offer_type == 'discount'))

merged_portfolio <- merge(portfolio, portfolio_channels, by = "id", all.x = TRUE) %>%
  merge(portfolio_offertype, by = "id", all.x = TRUE)

new_portfolio <- merged_portfolio %>%
  select(reward, difficulty, duration, offer_type, id, bogo, discount, informational, email, mobile, social, web)

unique_ids <- unique(new_portfolio$id)
id_mapping <- setNames(seq_along(unique_ids), unique_ids)

new_portfolio$id <- id_mapping[new_portfolio$id]

# Checking for null values in each column
col_sums_null <- colSums(is.na(new_portfolio))

duplicated_rows <- new_portfolio[duplicated(new_portfolio), ]

PROFILE

unique_ids <- unique(profile$id)
id_mapping <- setNames(seq_along(unique_ids), unique_ids)

profile$id <- id_mapping[profile$id]
na_counts <- colSums(is.na(profile))
cat("ORIGINAL NA VALUES:", na_counts)
## ORIGINAL NA VALUES: 0 0 0 0 2175
# Remove rows with NA values and Age equal to 118
profile_new <- subset(profile, !is.na(age) & age != 118)

na_counts_after_cleaning <- colSums(is.na(profile_new))
cat("Count of NA values afer removing:",na_counts_after_cleaning)
## Count of NA values afer removing: 0 0 0 0 0
profile <- na.omit(profile_new)

profile_new$became_member_on <- as.Date(as.character(profile_new$became_member_on), format = "%Y%m%d")

duplicated_rows <- profile_new[duplicated(profile_new), ]

TRANSCRIPT

library(dplyr)

# Extract offer_id from the 'value' column
transcript <- transcript %>%
  mutate(offer_id = ifelse(grepl("'offer id'", value),
                           gsub(".*'offer id':\\s*'([[:alnum:]]+)'.*", "\\1", value),
                           ifelse(grepl("'offer_id'", value),
                                  gsub(".*'offer_id':\\s*'([[:alnum:]]+)'.*", "\\1", value),
                                  NA)))
# Create amount column
transcript <- transcript %>%
  mutate(amount = ifelse(!is.na(str_extract(value, '"amount": ([0-9.]+)')), 
                         as.numeric(str_extract(value, '"amount": ([0-9.]+)')), 0))

# Create reward_given column
transcript <- transcript %>%
  mutate(reward_given = ifelse(!is.na(str_extract(value, '"reward": ([0-9]+)')), 
                               as.numeric(str_extract(value, '"reward": ([0-9]+)')), 0))

# Remove value column
transcript <- select(transcript, -value)

if (!requireNamespace("digest", quietly = TRUE)) {
  install.packages("digest")
}
library(digest)


# Function to convert person value to an integer
map_person_to_int <- function(person_value) {
  # Calculate the hash value using SHA-256
  hash_value <- digest(person_value, algo = "sha256", serialize = FALSE)
  
  # Convert the hash value to a numeric representation
  person_integer <- sum(as.integer(charToRaw(hash_value)))
  
  return(person_integer)
}

transcript$person <- sapply(transcript$person, map_person_to_int)

# Create a mapping dictionary
unique_ids <- unique(transcript$offer_id)
id_mapping <- setNames(seq_along(unique_ids), unique_ids)

transcript$offer_id <- id_mapping[transcript$offer_id]


library(dplyr)

# Create binary columns for each event
event_list <- c('offer completed', 'offer received', 'offer viewed', 'transaction')

transcript_new <- transcript %>%
  mutate(offer_completed = as.numeric(event == 'offer completed'),
         offer_received = as.numeric(event == 'offer received'),
         offer_viewed = as.numeric(event == 'offer viewed'),
         transaction = as.numeric(event == 'transaction'))

# Display the updated data frame
head(transcript_new)
person event time offer_id amount reward_given offer_completed offer_received offer_viewed transaction
0 4182 offer received 0 1 0 0 0 1 0 0
1 4547 offer received 0 2 0 0 0 1 0 0
2 4272 offer received 0 3 0 0 0 1 0 0
3 4550 offer received 0 4 0 0 0 1 0 0
4 4161 offer received 0 5 0 0 0 1 0 0
5 4438 offer received 0 6 0 0 0 1 0 0

PORTFOLIO

library(dplyr)

# Rename columns in the 'portfolio' data frame
new_portfolio <- new_portfolio %>%
  rename(offer_id = id, offer_reward = reward)

transcript1 <- read.csv("data/transcript.csv", row.names =1)
profile1 <- read.csv("data/profile.csv", row.names = 1)
profile1$gender <- as.factor(profile1$gender)

# Impute missing values in 'income' with mean
profile1$income[is.na(profile1$income)] <- mean(profile1$income, na.rm = TRUE)

# Remove any non-numerc characters
profile1$became_member_on <- gsub("[^0-9]", "", profile1$became_member_on)

profile1$became_member_on <- as.Date(profile1$became_member_on, format = "%Y%m%d")
profile1$membership_duration <- as.numeric(difftime(Sys.Date(), profile1$became_member_on, units = "days"))
profile1$membership_duration <- as.numeric(difftime(Sys.Date(), profile1$became_member_on, units = "days"))

library(jsonlite)

# Extract offer id from 'value' column
transcript1$value <- gsub("'", "\"", transcript1$value)  # Replace single quotes with double quotes
transcript1$offer_id <- sapply(transcript1$value, function(x) {
  parsed_value <- fromJSON(x, simplifyVector = TRUE)
  if (!is.null(parsed_value) && 'offer id' %in% names(parsed_value)) {
    return(parsed_value[['offer id']])
  } else {
    return(NA)
  }
})

# Create binary columns for different events
transcript1$offer_received <- as.integer(transcript1$event == "offer received")
transcript1$offer_viewed <- as.integer(transcript1$event == "offer viewed")
transcript1$offer_completed <- as.integer(transcript1$event == "offer completed")
transcript1$transaction <- as.integer(transcript1$event == "transaction")

# Rename columns in the 'transcript' data frame
transcript_new <- transcript_new %>%
  rename(user_id = person)

# Rename columns in the 'profile' data frame
profile_new <- profile_new %>%
  rename(user_id = id)

# Left join on 'offer_id'
full_df <- left_join(transcript_new, new_portfolio, by = 'offer_id')

# Inner join on 'user_id'
full_df <- inner_join(full_df, profile_new, by = 'user_id')

head(new_portfolio)
offer_reward difficulty duration offer_type offer_id bogo discount informational email mobile social web
5 20 10 discount 1 0 1 0 1 0 0 1
3 7 7 discount 2 0 1 0 1 1 1 1
2 10 7 discount 3 0 1 0 1 1 0 1
0 0 4 informational 4 0 0 1 1 1 0 1
10 10 5 bogo 5 1 0 0 1 1 1 1
0 0 3 informational 6 0 0 1 1 1 1 0

MERGED DATA

# Merge profile and transcript datasets based on 'id' (customer ID)
merged_data <- merge(profile1, transcript1, by.x = "id", by.y = "person", all.x = TRUE)

merged_data <- merge(merged_data, portfolio, by.x = "offer_id", by.y = "id", all.x = TRUE)

head(merged_data)
offer_id id gender age became_member_on income membership_duration event value time offer_received offer_viewed offer_completed transaction reward channels difficulty duration offer_type
0b1e1539f2cc45b7b9fa7c272da2e1d7 1314cb7b712649af9acf1ac19aadd267 118 2017-11-10 65405 2224 offer received {“offer id”: “0b1e1539f2cc45b7b9fa7c272da2e1d7”} 504 1 0 0 0 5 [‘web’, ‘email’] 20 10 discount
0b1e1539f2cc45b7b9fa7c272da2e1d7 2b91e3c00a424185ab261b2a204c9718 M 58 2017-09-16 53000 2279 offer received {“offer id”: “0b1e1539f2cc45b7b9fa7c272da2e1d7”} 168 1 0 0 0 5 [‘web’, ‘email’] 20 10 discount
0b1e1539f2cc45b7b9fa7c272da2e1d7 dffe25b1294f4166a72f73f57926214e F 61 2017-10-31 77000 2234 offer viewed {“offer id”: “0b1e1539f2cc45b7b9fa7c272da2e1d7”} 348 0 1 0 0 5 [‘web’, ‘email’] 20 10 discount
0b1e1539f2cc45b7b9fa7c272da2e1d7 bd920b7b033d42fb9d97be92343a9aef F 52 2017-01-12 97000 2526 offer received {“offer id”: “0b1e1539f2cc45b7b9fa7c272da2e1d7”} 0 1 0 0 0 5 [‘web’, ‘email’] 20 10 discount
0b1e1539f2cc45b7b9fa7c272da2e1d7 9504588a36ed4c868759a7092cdfa7b1 118 2018-02-07 65405 2135 offer viewed {“offer id”: “0b1e1539f2cc45b7b9fa7c272da2e1d7”} 360 0 1 0 0 5 [‘web’, ‘email’] 20 10 discount
0b1e1539f2cc45b7b9fa7c272da2e1d7 cc3ad71378a240a0841158e9aefe046e M 25 2017-02-20 35000 2487 offer received {“offer id”: “0b1e1539f2cc45b7b9fa7c272da2e1d7”} 408 1 0 0 0 5 [‘web’, ‘email’] 20 10 discount

EXPLORATORY DATA ANALYSIS

Age and Gender

# Create a side-by-side boxplot and histogram
par(mfrow = c(1, 2), mar = c(5, 4, 4, 2))
boxplot(profile_new$age, xlab = "Age", main = "Boxplot", col = "lightblue")
hist(profile_new$age, xlab = "Age", main = "Histogram", col = "lightblue")

# Adjust axis label sizes
par(cex.lab = 1.5)

# Print descriptive statistics
summary(profile_new$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    18.0    42.0    55.0    54.4    66.0   101.0
age_sd <- sd(profile_new$age)

# Print the standard deviation
cat("Standard Deviation of Age:", age_sd, "\n")
## Standard Deviation of Age: 17.4
  1. The customer age range spans from 18 years as the youngest to 101 years as the oldest.

  2. The distribution of customer ages appears to approximate a normal distribution, with a mean and standard deviation of approximately 54 and 17, respectively.

Gender distribution.

library(ggplot2)
library(plotly)

# Create a data frame with the count of each gender category
gender_counts <- table(profile$gender)

# Calculate percentages
gender_percentages <- round((gender_counts / sum(gender_counts)) * 100, 1)

# Define custom colors
custom_colors <- c("#FF6F61", "#6B5B95", "#88B04B")  # You can use any color codes you like

# Create the 3D pie chart
pie_chart <- plot_ly(
  labels = names(gender_counts),
  values = gender_counts,
  type = "pie",
  textinfo = "label+percent",
  marker = list(colors = custom_colors),
  pull = c(0.1, 0.1, 0.2)  # Adjust pull for exploding wedges
) %>%
  layout(
    title = "Gender Distribution",
    scene = list(
      aspectmode = "cube",  # Center the chart
      camera = list(eye = list(x = 1.25, y = 1.25, z = 0.85))  # 3D view settings
    ),
    showlegend = FALSE
  )

# Display the 3D pie chart
pie_chart
  1. The customer base consists of a larger proportion of males (57.2%) compared to females (41.3%), with a minor representation (1.4%) from customers identifying with other genders.

Data distribution among different events.

library(ezids)
# Get value counts for the 'event' column
event_value_counts <- table(transcript_new$event)
# EDA on event occurences
# Load the required libraries
library(ggplot2)
library(plotly)

# Create a data frame from the event counts
event_counts_df <- data.frame(event = names(event_value_counts), count = as.numeric(event_value_counts))

# Define a vector of colors for  four events
event_colors <- c("Coral", "Cyan", "Magenta", "Turquoise")

# Create a ggplot2 bar chart 
p <- ggplot(event_counts_df, aes(x = event, y = count, fill = event)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = event_colors) +
  labs(title = "Event Distribution", x = "Event", y = "Count") +
  theme_minimal() +
  scale_y_continuous(breaks = seq(0, max(event_counts_df$count), by = 20000)) +
  coord_flip() +
  theme(panel.border = element_rect(color = "black", fill = NA),
        panel.grid = element_blank(),
        axis.text.x = element_text(face = "bold", color = "black"),
        axis.text.y = element_text(face = "bold", color = "black")) 

# Make the ggplot2 chart interactive using plotly
interactive_plot <- ggplotly(p)

# Display the interactive plot
interactive_plot

As expected, not all recieved offers were viewed and not all recieved offers were completed. The dataset contains 45% transaction events and 55% offer events.

Percentages of each offer type sent.

colors <- c("#ad6a6c", "#d0ada7", "#e8d6cb")

# Count the frequency of each offer type
offer_counts <- table(new_portfolio$offer_type)

ggplot(data = as.data.frame(offer_counts), aes(x = Var1, y = Freq)) +
  geom_bar(stat = "identity", fill = colors, color = "black") +
  labs(title = "Frequency of Each Offer Type",
       x = "Offer Type",
       y = "Count") +
  scale_fill_manual(values = colors)

From the above observation BOGO and the discount offer type has the maximum count of nearly 66,000 and 62,000 respectively while the informational offer had the least count.

# Load the ggplot2 library
library(ggplot2)

# Create a scatterplot
ggplot(profile_new, aes(x = gender, y = income, color = gender)) +
  geom_jitter(width = 0.3, alpha = 0.7, size = 1) +
  labs(title = "Scatterplot of Income vs Gender", x = "Gender", y = "Income") +
  scale_color_manual(values = c("#E41A1C", "#377EB8", "#4DAF4A")) +  # Custom colors
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5),
        panel.grid.major.y = element_line(color = "gray90"),
        axis.text.x = element_text(angle = 45, hjust = 1))

  1. Male customers have a right-skewed income distribution, indicating that a larger proportion of male customers falls within the lower half of the income spectrum among the company’s customer base.

  2. Male customers have a right-skewed income distribution, indicating that a larger proportion of male customers falls within the lower half of the income spectrum among the company’s customer base.

  3. Female customers have a significantly higher average income compared to customers of other genders. This disparity in income may be attributed to the assumption that female customers, on average, are older than customers in other gender groups.

  4. On average, female customers have an income of $71,000, while male customers have an average income of $61,000. Customers of other genders have an average income of $63,000.

Income vs Gender vs Age

# Load necessary libraries
library(plotly)
library(dplyr)

# Create a 3D scatterplot with Plotly
scatter_3d <- profile_new %>%
  plot_ly(x = ~age, y = ~income, z = ~gender, color = ~gender, colors = c("#E41A1C", "#377EB8", "#4DAF4A"),
          type = "scatter3d", mode = "markers",
          marker = list(size = 2)) %>%
  layout(scene = list(xaxis = list(title = "Age"),
                     yaxis = list(title = "Income"),
                     zaxis = list(title = "Gender")))

# Show the 3D scatterplot
scatter_3d
  • Income tends to increase with age for both men and women. This is evident from the fact that the scatterplot shows a general upward trend from left to right.

  • Another interesting thing we can observe is that female customers with higher incomes, on average, are older than customers in other gender groups.

The customer base consists of a larger proportion of males (57.2%) compared to females (41.3%), with a minor representation (1.4%) from customers identifying with other genders. The customer age range spans from 18 years as the youngest to 101 years as the oldest. The income of customers spans from 30k to 120k, with an average income of 65.4k. The income distribution closely mirrors that of the general population. There is a logical relationship between age and cafe visits, as individuals between 46 and 75 years tend to have more available time to visit cafes. The income distribution varies between genders, with females having higher average incomes

DATA MODELLING

SMART Q1: How can we design a precise predictive model to classify customer responses to offers as successful or not?

Now that we have analyzed the dataset, we will proceed by creating a model that would predict whether a user will respond to an offer or not. There are 4 scenarios that can happen:

A user will view and complete the offer. A user will just view the offer. A user will not view the offer, but will complete it anyway (without prior knowledge of the offer existence) A user will not view the offer and will not complete it.

Since starbucks are targetting users that will view the offer and complete it afterwards, our prediction would be a binary value as such: 1: User will view and complete the offer 0: Otherwise

In order to proceed with the prediction, we will need to create a new dataframe that will include the targeted features and the prediction column. The features that will be analyzed are:

Age Income Gender Offer_type Reward Duration Difficulty Channels

A new column will be created “offer_success” to show wehther a user will successfully view and complete the offer.

From the datasets, Offer types BOGO and discount have a clear criteria for completion and can be founded by looking at the event column with value “completed offer” and then double check that the timing of completion and viewing and offer expiration are consistent.

However, dealing with informational offers is different. Since informational offers are advertisement offers that don’t have a completion criteria, we will need to define how to consider them successful.

One way would be to look at all transactions and check if a transaction has occurred during an informational offer period. These transactions are considered to be influenced by the offer and thus the informational offer was successful. This is, of course, under the condition that a user has received and viewed the informational offer, THEN proceed to make a transaction.

In order to do that, create one dataset that includes all pairs (user_id and offer_id) of completed offers for Bogo and discount. Also create another dataset that includes all pairs (user_id and offer_id) of completed offers for type informational.

Next, will merge them together into a bigger dataset including all successfully completed offers.

library(dplyr)

# Filter offer_received_df
offer_received_df <- transcript_new %>%
  filter(offer_received == 1) %>%
  select(offer_id, user_id, time) %>%
  rename(time_received = time)

# Filter offer_viewed_df
offer_viewed_df <- transcript_new %>%
  filter(offer_viewed == 1) %>%
  select(offer_id, user_id, time) 
  


# Filter offer_completed_df
offer_completed_df <- transcript_new %>%
  filter(offer_completed == 1) %>%
  select(offer_id, user_id, time) 

# Merge offer_completed_df and offer_viewed_df
complete_bogo_discount_df <- merge(offer_completed_df, offer_viewed_df, by = c('offer_id', 'user_id'))

# Merge with offer_received_df
complete_bogo_discount_df <- merge(complete_bogo_discount_df, offer_received_df, by = c('offer_id', 'user_id'))

complete_bogo_discount_df <- complete_bogo_discount_df %>%
  rename(time_completed = time.x, time_viewed = time.y)


complete_bogo_discount_df <- merge(complete_bogo_discount_df, new_portfolio, by = 'offer_id')

# Display the resulting data frame
head(complete_bogo_discount_df)
offer_id user_id time_completed time_viewed time_received offer_reward difficulty duration offer_type bogo discount informational email mobile social web
1 3841 624 588 576 5 20 10 discount 0 1 0 1 0 0 1
1 3841 624 588 336 5 20 10 discount 0 1 0 1 0 0 1
1 3883 504 576 504 5 20 10 discount 0 1 0 1 0 0 1
1 3925 60 18 0 5 20 10 discount 0 1 0 1 0 0 1
1 3925 60 18 168 5 20 10 discount 0 1 0 1 0 0 1
1 3925 60 228 0 5 20 10 discount 0 1 0 1 0 0 1
library(dplyr)


complete_bogo_discount_df <- complete_bogo_discount_df %>%
  mutate(time_expire = time_received + duration * 24)

# Display the resulting data frame
head(complete_bogo_discount_df)
offer_id user_id time_completed time_viewed time_received offer_reward difficulty duration offer_type bogo discount informational email mobile social web time_expire
1 3841 624 588 576 5 20 10 discount 0 1 0 1 0 0 1 816
1 3841 624 588 336 5 20 10 discount 0 1 0 1 0 0 1 576
1 3883 504 576 504 5 20 10 discount 0 1 0 1 0 0 1 744
1 3925 60 18 0 5 20 10 discount 0 1 0 1 0 0 1 240
1 3925 60 18 168 5 20 10 discount 0 1 0 1 0 0 1 408
1 3925 60 228 0 5 20 10 discount 0 1 0 1 0 0 1 240
library(dplyr)

complete_bogo_discount_df <- complete_bogo_discount_df %>%
  filter(
    time_received <= time_viewed,
    time_viewed <= time_completed,
    time_completed <= time_expire
  )

# Display the resulting data frame
head(complete_bogo_discount_df)
offer_id user_id time_completed time_viewed time_received offer_reward difficulty duration offer_type bogo discount informational email mobile social web time_expire
1 3841 624 588 576 5 20 10 discount 0 1 0 1 0 0 1 816
1 3925 60 18 0 5 20 10 discount 0 1 0 1 0 0 1 240
1 3926 42 24 0 5 20 10 discount 0 1 0 1 0 0 1 240
1 3984 462 438 408 5 20 10 discount 0 1 0 1 0 0 1 648
1 3995 240 174 0 5 20 10 discount 0 1 0 1 0 0 1 240
1 3995 240 174 168 5 20 10 discount 0 1 0 1 0 0 1 408

We have created the first dataset, we will repeat similar logic to create second dataset.

library(dplyr)

# Dataframe holding the events where a transaction took place
transaction_df <- transcript_new %>%
  filter(transaction == 1) %>%
  select(user_id, time, amount)

# Merge transaction and offer_viewed dataframes
complete_info_df <- left_join(transaction_df, offer_viewed_df, by = 'user_id')

# Merge with offer_received_df
complete_info_df <- left_join(complete_info_df, offer_received_df, by = c('offer_id', 'user_id'))
complete_info_df <- complete_info_df %>%
  rename(time_transaction = time.x, time_viewed = time.y)


# Merge with new_portfolio
complete_info_df <- left_join(complete_info_df, new_portfolio, by = 'offer_id')

# Calculate offer expiration
complete_info_df <- complete_info_df %>%
  mutate(time_expire = time_received + duration * 24)

# Choose only informational offer
complete_info_df <- filter(complete_info_df, informational == 1)

#filter based on the mentioned criteria
complete_info_df <- complete_info_df %>%
  filter(
    time_viewed >= time_received,
    time_transaction <= time_expire,
    time_viewed <= time_transaction
  )


complete_bogo_discount_df <- mutate(complete_bogo_discount_df, offer_success = 1)
complete_info_df <- mutate(complete_info_df, offer_success = 1)


complete_bogo_discount_df <- complete_bogo_discount_df %>%
  select(user_id, offer_id, offer_success)

complete_info_df <- complete_info_df %>%
  select(user_id, offer_id, offer_success)


head(complete_info_df)
user_id offer_id offer_success
4665 4 1
4665 4 1
4598 6 1
4598 6 1
4598 6 1
4400 6 1
library(dplyr)
concat_r <- bind_rows(complete_bogo_discount_df, complete_info_df)

head(concat_r)
user_id offer_id offer_success
3841 1 1
3925 1 1
3926 1 1
3984 1 1
3995 1 1
3995 1 1
library(dplyr)


df1 <- offer_received_df %>%
  select(offer_id, user_id) %>%
  distinct()  # Ensure unique pairs

df2 <- concat_r %>%
  select(offer_id, user_id, offer_success)

all_clean_df <- left_join(df1, df2, by = c('offer_id', 'user_id'))

# Consider the other offers as unsuccessful
all_clean_df$offer_success <- ifelse(is.na(all_clean_df$offer_success), 0, all_clean_df$offer_success)


all_clean_df <- data.frame(all_clean_df, row.names = NULL)


# Subset Columns
model_df <- all_clean_df %>%
  select(user_id, offer_id, offer_success)

# Merge with 'portfolio' DataFrame
model_df <- left_join(model_df, new_portfolio, by = 'offer_id')
# Merge with 'profile' DataFrame
model_df <- left_join(model_df, profile_new, by = 'user_id')

# Drop Columns
model_df <- model_df %>%
  select(-became_member_on, -offer_type)

head(model_df)
user_id offer_id offer_success offer_reward difficulty duration bogo discount informational email mobile social web gender age income
4182 1 1 5 20 10 0 1 0 1 0 0 1 M 58 61000
4182 1 1 5 20 10 0 1 0 1 0 0 1 M 58 61000
4182 1 1 5 20 10 0 1 0 1 0 0 1 M 58 61000
4182 1 1 5 20 10 0 1 0 1 0 0 1 M 58 61000
4182 1 1 5 20 10 0 1 0 1 0 0 1 M 58 61000
4547 2 1 3 7 7 0 1 0 1 1 1 1 F 30 31000
normalize_data <- function(df, column) {
  # Min-Max Normalization
  df[[column]] <- (df[[column]] - min(df[[column]])) / (max(df[[column]]) - min(df[[column]]))
  
  # Return the modified data frame
  return(df)
}


# Create Integer Mapping for 'gender' Column
gender_levels <- unique(model_df$gender)
gender_map <- setNames(as.integer(seq_along(gender_levels)), gender_levels)

model_df$gender <- gender_map[model_df$gender]

# Normalize 'age' and 'income' Columns
normalize_data <- function(x) {
  return((x - min(x)) / (max(x) - min(x)))
}

model_df$age <- normalize_data(model_df$age)
model_df$income <- normalize_data(model_df$income)

Drop specified columns

columns_to_drop <- c('gender', 'age', 'income')
model_df <- model_df[, !(names(model_df) %in% columns_to_drop)]

head(model_df)
user_id offer_id offer_success offer_reward difficulty duration bogo discount informational email mobile social web
4182 1 1 5 20 10 0 1 0 1 0 0 1
4182 1 1 5 20 10 0 1 0 1 0 0 1
4182 1 1 5 20 10 0 1 0 1 0 0 1
4182 1 1 5 20 10 0 1 0 1 0 0 1
4182 1 1 5 20 10 0 1 0 1 0 0 1
4547 2 1 3 7 7 0 1 0 1 1 1 1

Three classifier algorithms will be used

1.LinearSVC 2.Decision Tree Classifier 3.k-nearest neighbors

To evaluate a model, we will look into f1-score. The F1 score can be interpreted as a weighted average of the precision and recall which conveys the balance between them.

Looking at Precision value alone would ignore the False Negatives and would make us miss valuable customers that can potentially complete an offer.

Similarity, looking at Recall value alone would ignore False Positives which can make us send offers to everyone and flood users with offers they are not interested in.

For that, F1 score is the best choice in this case as it provides the balance between them.

# Load libraries
library(dplyr)
library(glmnet)
library(caret)

# Select relevant features based on feature importance
selected_features <- c("offer_reward", "difficulty", "duration", "bogo", "discount", "email", "mobile", "social", "web")
formula <- as.formula(paste("offer_success ~", paste(selected_features, collapse = " + ")))

# Split the data into training and testing sets
set.seed(123)  # Set seed for reproducibility
sample_index <- sample(nrow(model_df), 0.8 * nrow(model_df))
train_data <- model_df[sample_index, ]
test_data <- model_df[-sample_index, ]

formula <- as.formula("offer_success ~ offer_reward + difficulty + duration + bogo + discount + informational + email + mobile + social + web")

# Train the logistic regression model
model_lm_old <- glm(formula, data = train_data, family = "binomial")

# Make predictions on the test set
predictions <- predict(model_lm_old, newdata = test_data, type = "response")

# Set a threshold to handle class imbalance
threshold <- 0.5
binary_predictions <- ifelse(predictions > threshold, 1, 0)

# Evaluate the model
conf_matrix <- table(binary_predictions, test_data$offer_success)
accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)

conf_matrix_caret <- confusionMatrix(data = as.factor(binary_predictions), reference = as.factor(test_data$offer_success))

print(conf_matrix)
##                   
## binary_predictions      0      1
##                  0    345      0
##                  1    226 315211
print(paste("Accuracy:", round(accuracy, 4)))
## [1] "Accuracy: 0.9993"
print("\n***Logistic regression***\n")
## [1] "\n***Logistic regression***\n"
print(conf_matrix_caret)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction      0      1
##          0    345      0
##          1    226 315211
##                                         
##                Accuracy : 0.999         
##                  95% CI : (0.999, 0.999)
##     No Information Rate : 0.998         
##     P-Value [Acc > NIR] : <2e-16        
##                                         
##                   Kappa : 0.753         
##                                         
##  Mcnemar's Test P-Value : <2e-16        
##                                         
##             Sensitivity : 0.60420       
##             Specificity : 1.00000       
##          Pos Pred Value : 1.00000       
##          Neg Pred Value : 0.99928       
##              Prevalence : 0.00181       
##          Detection Rate : 0.00109       
##    Detection Prevalence : 0.00109       
##       Balanced Accuracy : 0.80210       
##                                         
##        'Positive' Class : 0             
## 

OBSERVATION:

In this logistic regression model, we aim to predict offer_success based on features such as offer_reward, difficulty, duration, and communication channels. Despite achieving a high accuracy of 99.93%, the model’s effectiveness is questionable due to several concerns:

  1. Class Imbalance: The dataset shows a significant imbalance between classes, which could lead to a bias towards predicting the majority class, thereby inflating accuracy.

  2. Sensitivity Concerns: The low sensitivity (60.42%) indicates the model’s limited ability in correctly identifying true positives for the minority class.

  3. Potential Overfitting: The high accuracy might be indicative of overfitting, especially if the test set does not represent the overall population diversity.

  4. Metric Reliability: Sole reliance on accuracy is misleading in imbalanced datasets; alternative metrics like F1-score or AUC-ROC are recommended for a more comprehensive evaluation.

In summary, while the model shows high accuracy, its real-world applicability is limited by issues like class imbalance, potential overfitting, and an over-reliance on accuracy as the primary performance metric. Addressing these concerns is crucial for developing a more robust and reliable predictive model.

# install.packages("ROSE")
library(ROSE)

library(dplyr)
library(glmnet)
library(caret)

selected_features <- c("offer_reward", "difficulty", "duration", "bogo", "discount", "informational", "mobile", "social", "web")
formula <- as.formula(paste("offer_success ~", paste(selected_features, collapse = " + ")))

set.seed(123)  # Set seed for reproducibility
sample_index <- sample(nrow(model_df), 0.8 * nrow(model_df))
train_data <- model_df[sample_index, ]
test_data <- model_df[-sample_index, ]

# Address Class Imbalance: Use ROSE for oversampling the minority class
oversampled_train_data <- ROSE(formula, data = train_data, p = 0.5, seed = 123)$data

formula <- as.formula(paste("offer_success ~", paste(selected_features, collapse = " + ")))

# Train the logistic regression model with regularization (elastic net)
model_lm_new <- cv.glmnet(x = as.matrix(oversampled_train_data[, selected_features, drop = FALSE]),  # Ensure matrix format
                   y = oversampled_train_data$offer_success,
                   alpha = 0.5,  # Adjust alpha for desired elastic net mixing
                   family = "binomial",
                   standard.error = TRUE)

# Make predictions on the test set
predictions <- predict(model_lm_new, newx = as.matrix(test_data[, selected_features, drop = FALSE]), s = "lambda.min", type = "response")

# Set a threshold to handle class imbalance
threshold <- 0.5
binary_predictions <- ifelse(predictions > threshold, 1, 0)

# Evaluate the model
conf_matrix <- table(binary_predictions, test_data$offer_success)
accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)

conf_matrix_caret <- confusionMatrix(data = as.factor(binary_predictions), reference = as.factor(test_data$offer_success))

print(conf_matrix)
##                   
## binary_predictions      0      1
##                  0    542  37998
##                  1     29 277213
print(paste("Accuracy:", round(accuracy, 4)))
## [1] "Accuracy: 0.8796"
print("\n***Logistic regression with regularization***\n")
## [1] "\n***Logistic regression with regularization***\n"
print(conf_matrix_caret)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction      0      1
##          0    542  37998
##          1     29 277213
##                                         
##                Accuracy : 0.88          
##                  95% CI : (0.878, 0.881)
##     No Information Rate : 0.998         
##     P-Value [Acc > NIR] : 1             
##                                         
##                   Kappa : 0.024         
##                                         
##  Mcnemar's Test P-Value : <2e-16        
##                                         
##             Sensitivity : 0.94921       
##             Specificity : 0.87945       
##          Pos Pred Value : 0.01406       
##          Neg Pred Value : 0.99990       
##              Prevalence : 0.00181       
##          Detection Rate : 0.00172       
##    Detection Prevalence : 0.12205       
##       Balanced Accuracy : 0.91433       
##                                         
##        'Positive' Class : 0             
## 
summary(model_lm_new)
##            Length Class  Mode     
## lambda     80     -none- numeric  
## cvm        80     -none- numeric  
## cvsd       80     -none- numeric  
## cvup       80     -none- numeric  
## cvlo       80     -none- numeric  
## nzero      80     -none- numeric  
## call        6     -none- call     
## name        1     -none- character
## glmnet.fit 13     lognet list     
## lambda.min  1     -none- numeric  
## lambda.1se  1     -none- numeric  
## index       2     -none- numeric

OBSERVATION:

In this improved logistic regression model, we addressed previous limitations through regularization and class imbalance correction. We employed the ROSE package for oversampling the minority class in our training data, leading to a more balanced dataset. Additionally, we utilized the glmnet package to apply elastic net regularization, a combination of L1 and L2 regularization techniques, enhancing the model’s ability to generalize and reducing the risk of overfitting.

The new model’s performance was evaluated using a confusion matrix and accuracy. The accuracy decreased to 87.96%, a more realistic figure considering the previously imbalanced nature of the dataset. Notably, the sensitivity increased to 94.92%, indicating a significantly improved ability to correctly identify true positives for the minority class. The specificity also remained high at 87.95%, confirming the model’s effectiveness in identifying true negatives.

Key improvements in this model include:

  1. Handling Class Imbalance: Oversampling the minority class created a more balanced training dataset, allowing the model to learn more effectively from both classes.

  2. Regularization: The use of elastic net regularization helped in reducing overfitting by penalizing complex models, thus enhancing the model’s generalizability.

  3. Realistic Accuracy: The adjusted accuracy presents a more realistic view of the model’s performance in a balanced scenario.

  4. Improved Sensitivity: The substantial increase in sensitivity demonstrates the model’s enhanced capability in detecting the minority class.

Overall, these adjustments led to a more robust and reliable logistic regression model, better suited for practical applications due to its improved handling of class imbalance and overfitting.

# Plot ROC curve
library(pROC)
roc_curve <- roc(test_data$offer_success, predictions)
plot(roc_curve, main = "ROC Curve", col = "blue", lwd = 2)

# Calculate and print AUC value
auc_value <- auc(roc_curve)
legend("bottomright", legend = paste("AUC =", round(auc_value, 4)), col = "blue", lwd = 2)

OBSERVATION: The ROC-AUC curve displayed represents the performance of the improved logistic regression model. The curve illustrates the trade-off between sensitivity (true positive rate) and specificity (true negative rate) across different thresholds. The AUC (Area Under the Curve) value is a summary measure of the curve; in this model, the AUC is 0.9727.

An AUC of 0.9727 suggests a high degree of separability between the positive and negative classes, indicating that the model is capable of distinguishing between the two with high accuracy. The closer the AUC is to 1, the better the model is at predicting 0s as 0s and 1s as 1s. An AUC value significantly higher than 0.5, which would represent a random guess, indicates that the model has a strong predictive performance.

In summary, the ROC-AUC curve and the high AUC value of 0.9727 reflect the model’s robustness and its improved ability to identify both classes accurately, signifying a substantial advancement from the initial logistic regression approach.

# Load libraries
library(dplyr)
library(rpart)
library(caret)


# Split the data into training and testing sets
set.seed(123)  # Set seed for reproducibility
sample_index <- sample(nrow(model_df), 0.8 * nrow(model_df))
train_data <- model_df[sample_index, ]
test_data <- model_df[-sample_index, ]

formula <- as.formula("offer_success ~ user_id + offer_id + offer_reward + difficulty + duration + bogo + discount + informational + email + mobile + social + web")

# Train the Decision Tree model
model_dtree_old <- rpart(formula, data = train_data, method = "class")

# Make predictions on the test set
predictions <- predict(model_dtree_old, newdata = test_data, type = "class")

# Evaluate the model
conf_matrix <- table(predictions, test_data$offer_success)
accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)

# Use caret's confusionMatrix function for a detailed report
conf_matrix_caret <- confusionMatrix(data = as.factor(predictions), reference = as.factor(test_data$offer_success))

print(conf_matrix)
##            
## predictions      0      1
##           0    400     56
##           1    171 315155
print(paste("Accuracy:", round(accuracy, 4)))
## [1] "Accuracy: 0.9993"
print("\n***Decision Tree***\n")
## [1] "\n***Decision Tree***\n"
print(conf_matrix_caret)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction      0      1
##          0    400     56
##          1    171 315155
##                                         
##                Accuracy : 0.999         
##                  95% CI : (0.999, 0.999)
##     No Information Rate : 0.998         
##     P-Value [Acc > NIR] : < 2e-16       
##                                         
##                   Kappa : 0.779         
##                                         
##  Mcnemar's Test P-Value : 3.84e-14      
##                                         
##             Sensitivity : 0.70053       
##             Specificity : 0.99982       
##          Pos Pred Value : 0.87719       
##          Neg Pred Value : 0.99946       
##              Prevalence : 0.00181       
##          Detection Rate : 0.00127       
##    Detection Prevalence : 0.00144       
##       Balanced Accuracy : 0.85017       
##                                         
##        'Positive' Class : 0             
## 

OBSERVATION:

The Decision Tree model was constructed to predict offer_success using features including user demographics, offer details, and the modes of communication used to send the offer. Despite an apparently high accuracy of 99.93%, similar to the logistic regression model, this measure may not reflect true predictive performance due to several underlying issues:

Class Imbalance: The model was trained on a dataset with a significant class imbalance, which can lead to a bias towards the majority class. This is evident in the confusion matrix, where predictions for the minority class (0) are substantially less frequent.

Overfitting Potential: Decision Trees are prone to overfitting, especially when they are deep and complex. This can result in high accuracy on the training data but poor generalization to new data.

Kappa Statistic: Although the Kappa statistic of 0.7786 suggests a fair agreement beyond chance, it is still important to consider the impact of class imbalance on this metric.

Mcnemar’s Test: The p-value from Mcnemar’s test indicates a statistically significant difference between the model’s classification errors for the two classes, suggesting that the model may not perform equally well across both classes.

Sensitivity and Specificity: The sensitivity of 70.05% indicates that the model has moderate ability to correctly identify true positives for the minority class. However, the high specificity of 99.98% shows that the model is very good at identifying true negatives.

# install.packages("ROSE")
library(ROSE)
library(rpart)

# Subset data with top features
top_features <- c("offer_reward", "difficulty", "duration", "bogo", "mobile", "social", "web")
train_data <- train_data[, c(top_features, "offer_success")]
test_data <- test_data[, c(top_features, "offer_success")]

# Set seed for reproducibility
set.seed(123)

# Define the model formula with top features
formula <- as.formula("offer_success ~ offer_reward + difficulty + duration + bogo + mobile + social + web")

# Apply random oversampling to the training data
oversampled_object <- ovun.sample(formula, data = train_data, method = "over", N = 2 * sum(train_data$offer_success == 1), seed = 123)

# Extract the balanced dataset from the oversampled object
train_data_balanced <- oversampled_object$data

# Train the Decision Tree model on the balanced data
model_dtree_new <- rpart(formula, data = train_data_balanced, method = "class", control = rpart.control(cp = 0.01))

# Make predictions on the test set
predictions <- predict(model_dtree_new, newdata = test_data, type = "class")

# Evaluate the model
conf_matrix <- table(predictions, test_data$offer_success)
accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)

conf_matrix_caret <- confusionMatrix(data = as.factor(predictions), reference = as.factor(test_data$offer_success))

print(conf_matrix)
##            
## predictions      0      1
##           0    542  37998
##           1     29 277213
print(paste("Accuracy:", round(accuracy, 4)))
## [1] "Accuracy: 0.8796"
print("\n***Decision Tree with random oversampling***\n")
## [1] "\n***Decision Tree with random oversampling***\n"
print(conf_matrix_caret)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction      0      1
##          0    542  37998
##          1     29 277213
##                                         
##                Accuracy : 0.88          
##                  95% CI : (0.878, 0.881)
##     No Information Rate : 0.998         
##     P-Value [Acc > NIR] : 1             
##                                         
##                   Kappa : 0.024         
##                                         
##  Mcnemar's Test P-Value : <2e-16        
##                                         
##             Sensitivity : 0.94921       
##             Specificity : 0.87945       
##          Pos Pred Value : 0.01406       
##          Neg Pred Value : 0.99990       
##              Prevalence : 0.00181       
##          Detection Rate : 0.00172       
##    Detection Prevalence : 0.12205       
##       Balanced Accuracy : 0.91433       
##                                         
##        'Positive' Class : 0             
## 

OBSERVATION:

for the improved decisoin tree model: The enhanced Decision Tree model was developed to address the shortcomings of the initial model, notably the class imbalance and potential for overfitting. To this end, we applied the ROSE package to perform oversampling, creating a balanced representation of classes in the training data. We also refined the model by selecting the most relevant features and incorporated regularization techniques by setting the complexity parameter (cp) and the minimum number of observations that must exist in a node for a split to be attempted (minsplit).

The model’s accuracy post-improvement stands at 87.96%, lower than the initial model but more indicative of its true performance given the balanced dataset. The substantial improvement in sensitivity to 94.92% demonstrates the model’s enhanced capability in correctly identifying the minority class, which is critical in imbalanced datasets.

Furthermore, the application of pruning parameters in the decision tree has helped to mitigate overfitting by avoiding an overly complex tree that perfectly fits the training data at the expense of generalization.

In conclusion, the improved Decision Tree model represents a more accurate and reliable classifier than its predecessor. By addressing class imbalance through oversampling and incorporating regularization to combat overfitting, we have achieved a model that not only performs better on balanced data but also has a greater potential to generalize to unseen data. This has been substantiated by a more balanced accuracy, indicating improved model performance across both classes.

CONCLUSION: Based on the provided information and the AUC values from the ROC curves, the improved logistic regression model with an AUC of 0.9727 outperforms the improved Decision Tree model with an AUC of 0.9143. The higher AUC value for the logistic regression model indicates a superior ability to distinguish between the successful and unsuccessful offers. This is a significant metric in evaluating model performance, particularly in classification tasks where the objective is to maximize the correct identification of both positive and negative classes while minimizing the error rates.

Moreover, the logistic regression model displayed a balanced accuracy and an improved sensitivity over the initial model, which further suggests its enhanced capability to generalize and perform well on unseen data. In conclusion, the improved logistic regression model is deemed better in terms of the metrics used for evaluation in this context.

# Install and load necessary packages if not already installed
# install.packages(c("pROC", "caret", "ROCR"))
library(pROC)
library(caret)
library(ROCR)

# Function to plot ROC curve
plotROC <- function(predictions, labels, main = "ROC Curve") {
  roc_curve <- roc(labels, as.numeric(predictions))
  auc_value <- auc(roc_curve)
  
  plot(roc_curve, main = main, col = "blue", lwd = 2)
  abline(a = 0, b = 1, lty = 2, col = "red")  # Diagonal line for reference
  legend("bottomright", legend = paste("AUC =", round(auc_value, 4)), col = "blue", lwd = 2)
}

# Plot ROC curve
plotROC(predictions, test_data$offer_success, main = "ROC Curve for Decision Tree Model")

OBSERVATION:

The ROC-AUC curve presented for the improved Decision Tree model demonstrates its classification strength with an AUC of 0.9143. This value indicates a strong ability of the model to differentiate between the classes of ‘offer success’. An AUC close to 1.0 suggests excellent model performance, and at 0.9143, the model is considered very good, particularly in comparison to a model with random guessing (AUC of 0.5). This AUC value signifies that the model has a high true positive rate across various thresholds while maintaining a low false positive rate, showcasing its effectiveness in the classification task after the enhancements were made.

SMART Q2: Can we predict which customer segment a new customer is likely to belong to based on their demographics?

cluster_data <- merged_data %>%
  group_by(id) %>%
  summarise(
    age = first(age),
    income = first(income),
    num_transactions = sum(transaction, na.rm = TRUE),
    total_offers_received = sum(offer_received),
    total_offers_viewed = sum(offer_viewed),
    total_offers_completed = sum(offer_completed),
    membership_duration = first(membership_duration),
    gender = first(gender)
  )
scaled_data <- scale(cluster_data[, c("age", "income", "num_transactions", "total_offers_received", "total_offers_viewed", "total_offers_completed", "membership_duration")])

# Determine the optimal number of clusters (k) using the elbow method
wss <- numeric(10)
for (i in 1:10) {
  wss[i] <- sum(kmeans(scaled_data, centers = i)$withinss)
}

# Plot the elbow method to find the optimal k
plot(1:10, wss, type = "b", xlab = "Number of Clusters (k)", ylab = "Within Sum of Squares (WSS)")

# Based on the plot, choose the optimal k (elbow point)
optimal_k <- 4  # Adjust this based on the plot

# Print the result of the elbow method
cat("Optimal number of clusters (k) based on the elbow method:", optimal_k, "\n")
## Optimal number of clusters (k) based on the elbow method: 4
# Apply K-Means clustering with the optimal k
kmeans_model <- kmeans(scaled_data, centers = optimal_k)

# Add the cluster labels to the original dataset
cluster_data$cluster <- as.factor(kmeans_model$cluster)

# Analyze the characteristics of each cluster
cluster_profiles <- cluster_data %>%
  group_by(cluster) %>%
  summarise_all(mean)

print(cluster_profiles)
## # A tibble: 4 × 10
##   cluster    id   age income num_transactions total_offers_received
##   <fct>   <dbl> <dbl>  <dbl>            <dbl>                 <dbl>
## 1 1          NA 113.  64566.             6.48                  4.58
## 2 2          NA  61.6 86837.             6.59                  4.83
## 3 3          NA  50.5 57117.             5.64                  3.84
## 4 4          NA  49.7 52688.            14.3                   4.90
## # ℹ 4 more variables: total_offers_viewed <dbl>, total_offers_completed <dbl>,
## #   membership_duration <dbl>, gender <dbl>
# Based on the plot, choose the optimal k (elbow point)
optimal_k <- 4  # Adjust this based on the plot

# Apply K-Means clustering with the optimal k
kmeans_model <- kmeans(scaled_data, centers = optimal_k)

# Add the cluster labels to the original dataset
cluster_data$cluster <- as.factor(kmeans_model$cluster)

# Analyze the characteristics of each cluster
cluster_profiles <- cluster_data %>%
  group_by(cluster) %>%
  summarise_all(mean)

print(cluster_profiles)
## # A tibble: 4 × 10
##   cluster    id   age income num_transactions total_offers_received
##   <fct>   <dbl> <dbl>  <dbl>            <dbl>                 <dbl>
## 1 1          NA  50.5 57117.             5.64                  3.84
## 2 2          NA  49.7 52688.            14.3                   4.90
## 3 3          NA  61.6 86837.             6.59                  4.83
## 4 4          NA 113.  64566.             6.48                  4.58
## # ℹ 4 more variables: total_offers_viewed <dbl>, total_offers_completed <dbl>,
## #   membership_duration <dbl>, gender <dbl>
unique(merged_data$cluster)
## NULL
library(caret)
library(e1071)


set.seed(123)  # Set seed for reproducibility
index <- createDataPartition(cluster_data$cluster, p = 0.8, list = FALSE)
train_data <- cluster_data[index, ]
test_data <- cluster_data[-index, ]


table(cluster_data$cluster)

# Train SVM with class weights
svm_model <- svm(cluster ~ age + income +  num_transactions + total_offers_received + total_offers_viewed + total_offers_completed + membership_duration + gender, 
                 data = cluster_data, 
                 kernel = "radial",
                 class.weights = table(cluster_data$cluster) / nrow(cluster_data))

# Make predictions on the test set
predictions <- predict(svm_model, newdata = test_data)

# Evaluate the model
confusion_matrix <- table(predictions, test_data$cluster)
print(confusion_matrix)

# Calculate accuracy
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
print(paste("Accuracy: ", accuracy))

Observations: Effective clustering is demonstrated by the confusion matrix, which shows excellent accuracy 97% and few misclassifications across four groups.

# Assuming 'confusion_matrix' is your confusion matrix
# Extract the elements from the confusion matrix
tp <- confusion_matrix[2, 2]  # True Positives
fp <- confusion_matrix[1, 2]  # False Positives

# Calculate precision
precision <- tp / (tp + fp)

# Print precision
print(paste("Precision: ", precision))
## [1] "Precision:  0.992814371257485"

Observations: With an excellent precision, positive prediction reliability and false positive rates are also low.

tp <- confusion_matrix[2, 2]  # True Positives
fn <- confusion_matrix[2, 1]  # False Negatives
tn <- confusion_matrix[1, 1]  # True Negatives
fp <- confusion_matrix[1, 2]  # False Positives

# Calculate sensitivity (recall)
sensitivity <- tp / (tp + fn)

# Calculate specificity
specificity <- tn / (tn + fp)

# Print sensitivity and specificity
print(paste("Sensitivity: ", sensitivity))
## [1] "Sensitivity:  1"
print(paste("Specificity: ", specificity))
## [1] "Specificity:  0.994579945799458"

Observations: The model shows excellent specificity and sensitivity, highlighting balanced performance in detecting real positives and negatives.

CONCLUSION

SMART Q1

To design a precise predictive model for classifying customer responses to offers as successful or not, we implemented and iteratively improved logistic regression and decision tree models. By addressing class imbalance with techniques like oversampling and applying regularization methods to reduce overfitting, we enhanced the models’ predictive accuracies and generalization capabilities. The logistic regression model, after refinement, yielded an AUC of 0.9727, indicating a high ability to differentiate between classes and showing strong model performance. Meanwhile, the improved decision tree model achieved an AUC of 0.9143, which, while very good, was less indicative of an optimal performance than the logistic regression model.

Evaluating both models’ AUC values, sensitivities, specificities, and overall accuracies, the improved logistic regression model emerged as the more precise classifier for predicting customer responses. The combination of data preprocessing, feature selection, model evaluation, and fine-tuning led to the development of a robust predictive model that meets the objective of accurately classifying customer responses to offers.

SMART Q2

  1. Following the analysis using the Elbow method, we separated the entire set of data into four optimal clusters and then performed K-means clustering.
  2. The SVM model is then trained to anticipate cluster membership. The confusion matrix shows an excellent accuracy , with precision, sensitivity, and specificity all above 99%. These are very encouraging results.

To conclude, these metrics show that the model performs well in detecting both positive and negative information, and it is very effective at classifying data into the appropriate clusters with few errors. This degree of precision and accuracy shows that the model is strong and trustworthy when it comes to generating predictions using the dataset.

REFERENCES

  1. Nandamuri, P., & Gowthami, C. (2015). Influence of Consumer Demographics on Attitude Towards Branded Products-An Exploratory Study on Consumer Durables in Rural Markets. SSRN.

  2. Valerevna, I. S. (2022). Brand Evolution Based on Innovation: Starbucks Coffee Company Case Study.